Using machine learning, create a model to predict the manner in which a subject performs an exercise.
This report contains some exploratory data analysis of the training set, building two models and cross validation of the models.
This dataset contains measurements from the accelerometers on the belt, arm, forearm and dumbbell of 6 different participants. Each participant performed the exercise 5 different ways:
This report briefly summarizes the approaches taken in order to understand, explore and simplify the dataset as well as the models fitted. It fits both random forest and gradient boosting models, cross-validates on a test set, and performs the final validation on the test set provided. The final model used was a random forest model with close to 99% accuracy.
The data was first loaded into R. Any data marked as "NA", empty strings (""), and the Microsoft Excel expression "#DIV/0!" (indicating an infinite value, or divisiion by zero) were all set to be NA. The seed was set to 22100 for reproducibility.
As this is an assignment and in the interests of time and sanity I will be doing a simple training vs. test set cross validation. I would ideally in real life probably do a more extensive cross validation (maybe k-folds) but I'm working on my laptop and only have so much patience and computing power (and I have a feeling random forests or bagging are going to give me the best prediction, but they do tend to take forever).
library(caret)
library(tidyverse)
library(factoextra)
library(corrplot)
library(reshape2)
library(ggpubr)
library(randomForest)
The data was read in and split into two separate dataframes randomly, 70% training and 30% test.
dat <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"), row.names = 1)
set.seed(36362)
classe <- dat$classe
datpart <- createDataPartition(classe, p = 0.7, list = FALSE)
training <- dat[datpart,]
testing <- dat[-datpart,]
I then looked at the structure and the columns of the dataset (truncated for display). The first 6 variables appear to be aimed at identification of the subject (they refer to the user's name, the time they did the exercise, etc.). I very specifically don't want to use this as my aim is to predict based on the movement, and including the name (as you will see later) would make this useless for any new subjects. The ID columns I put in their own dataframe and includes all this information as well as the class variable. I then remove any variables with close to zero variance, and any columns with NAs, bringing the potential number of variables down to 52 from the original 159.
str(training, list.len = 10)
## 'data.frame': 13737 obs. of 159 variables:
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ raw_timestamp_part_1 : int 1323084231 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 1323084232 ...
## $ raw_timestamp_part_2 : int 788290 120339 304277 484323 484434 528316 576390 604281 732306 740353 ...
## $ cvtd_timestamp : chr "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" "05/12/2011 11:23" ...
## $ new_window : chr "no" "no" "no" "no" ...
## $ num_window : int 11 12 12 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.48 1.45 1.43 1.45 1.43 1.42 1.45 1.55 1.57 ...
## $ pitch_belt : num 8.07 8.05 8.06 8.16 8.17 8.18 8.21 8.2 8.08 8.06 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
## $ total_accel_belt : int 3 3 3 3 3 3 3 3 3 3 ...
## [list output truncated]
table(training$classe)
##
## A B C D E
## 3906 2658 2396 2252 2525
ID_training <- training[,c(1:6, ncol(training))]
NZV <- nearZeroVar(training)
training2 <- training[,-c(1:4, 6, NZV, ncol(training))]
ISNA <- is.na(colSums(training2))
training3 <- training2[,which(ISNA == FALSE)]
training3$classe <- classe[datpart]
With the remaining variables, I made some plots to look at how they interact with each other and with the classe variable. First I plotted the density of the differing observations for all of the different classes to see how the distributions look for all of the different variables and see if there are any obvious differences. The figure is large however we can clearly see that class A has some differing distributions in a number of the variables, including some with large peaks compared to those of the other classes. This is true for the other classes too, the distributions vary. For example "E", which refers to throwing hips forward, has a different distribution of many of the variables detected by the belt accelerometer.
melt.train <- cbind(user = ID_training$user_name, classe = ID_training$classe, training3) %>%
melt(id.vars = c("user", "classe"))
ggplot(melt.train, aes(x = value, colour = classe)) +
geom_density(lwd = 2) + facet_wrap(~variable, scales = "free", ncol = 6) + theme_bw() +
theme(strip.text = element_text(size = 16))
I then performed a correlation matrix to see how the variables correlate with each other.
cor.1 <- cor(training3[,-ncol(training3)])
corrplot(cor.1, type = "upper", order = "hclust", tl.cex = 0.8)